import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score,train_test_split, KFold, cross_val_predict
from sklearn.metrics import mean_squared_error,r2_score,roc_curve,auc,precision_recall_curve, accuracy_score, \
recall_score, precision_score, confusion_matrix, mean_absolute_error, f1_score, cohen_kappa_score, matthews_corrcoef, classification_report
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, ParameterGrid, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor,GradientBoostingClassifier,BaggingRegressor,BaggingClassifier, \
AdaBoostRegressor,AdaBoostClassifier,RandomForestRegressor,RandomForestClassifier
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from pyearth import Earth
import itertools as it
import time as time
import xgboost as xgb
import re Project Code
Team Zzz
1 Data quality check / cleaning / preparation
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks. An example is given below.
Our dataset originates from https://archive.ics.uci.edu/ml/datasets/Wine+Quality
To find the dataset below, click on “Data Folder” and then “winequality-white.csv”.
# Load in the original dataset for white wine
df = pd.read_csv('winequality-white.csv', sep=';')1.1 Distribution of response
By Diego Schummer
# Histogram for the distribution of the response (quality)
plt.hist(df['quality'], bins=10)
plt.title('Histogram of Quality Ratings')
plt.xlabel('Quality Rating')
plt.ylabel('Frequency')
plt.show()
# Check the proportion of each class of 'quality' in the whole dataset
print(y.value_counts(normalize = True))6 0.448755
5 0.297468
7 0.179665
8 0.035729
4 0.033279
3 0.004083
9 0.001021
Name: quality, dtype: float64
1.2 Data cleaning
By Diego Schummer & Yasmeen Nahas
No data cleaning was necessary since there were no missing values and the structure/data types of the response and predictors were usuable for our models (see code below)
# Checking the structure and data types
print(df.info())
# Checking for missing values
print(df.isnull().sum())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 4898 non-null float64
1 volatile acidity 4898 non-null float64
2 citric acid 4898 non-null float64
3 residual sugar 4898 non-null float64
4 chlorides 4898 non-null float64
5 free sulfur dioxide 4898 non-null float64
6 total sulfur dioxide 4898 non-null float64
7 density 4898 non-null float64
8 pH 4898 non-null float64
9 sulphates 4898 non-null float64
10 alcohol 4898 non-null float64
11 quality 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
None
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
1.3 Data preparation
By Diego Schummer
To prepare our data for analysis, we split the data into train and test sets, The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:
We split our data into a training and test set so that we could tune the hyperparameters for each model using the train set, and then test our models on the test set. However, when splitting, we stratified according to the response to account for the imbalance in the class dataset and ensure that the training and testing sets have similar distributions of the target variable, which can lead to more accurate and reliable model performance estimates.
We standardized our dataset to account for the different ranges of the predictors in the dataset.
# Define the features and the target
X = df.drop('quality', axis=1)
y = df['quality']
# Split the data into train and test sets, with stratification on the 'quality' column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)# Standardize the features
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns, index = X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns, index = X_test.index)
# Check the distribution of 'quality' in the whole dataset, the training set and the test set
print("Whole dataset:\n", y.value_counts(normalize=True))
print("\nTraining set:\n", y_train.value_counts(normalize=True))
print("\nTest set:\n", y_test.value_counts(normalize=True))Whole dataset:
6 0.448755
5 0.297468
7 0.179665
8 0.035729
4 0.033279
3 0.004083
9 0.001021
Name: quality, dtype: float64
Training set:
6 0.448698
5 0.297601
7 0.179684
8 0.035733
4 0.033180
3 0.004084
9 0.001021
Name: quality, dtype: float64
Test set:
6 0.448980
5 0.296939
7 0.179592
8 0.035714
4 0.033673
3 0.004082
9 0.001020
Name: quality, dtype: float64
2 Exploratory data analysis
By Yasmeen Nahas
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.
# Visualizing the distribution of the response
sns.histplot(data=df, x='quality')<AxesSubplot:xlabel='quality', ylabel='Count'>
# Correlation between predictors, and between predictors and the response
# Plots between predictors and the response helps to identify the possibility of outliers
sns.pairplot(df)# Step 1: Identify outliers for each predictor
outliers = {} # Dictionary to store outliers for each predictor
for column in df.columns:
# Calculate z-scores for each column
z_scores = np.abs((df[column] - df[column].mean()) / df[column].std())
# Define a threshold for outlier detection (e.g., z-score > 3)
threshold = 3
# Identify outliers based on the threshold
outliers[column] = df[z_scores > threshold]
# Step 2: Combine outliers from all predictors
combined_outliers = pd.concat(outliers.values()).drop_duplicates()
# Step 3: Determine common outliers
common_outliers = combined_outliers[combined_outliers.duplicated(keep=False)]
# Display the common outliers
print("Common Outliers:")
print(common_outliers)Common Outliers:
Empty DataFrame
Columns: [fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality]
Index: []
There are no common outliers - the outliers are different observations for different predictors.
# Checking the distribution of the variables from for individual predictors
selected_features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides',
'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
# Calculate the number of rows and columns based on the total number of features
num_features = len(selected_features)
num_rows = int(num_features ** 0.5)
num_cols = (num_features + num_rows - 1) // num_rows
# Create a grid of subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
# Iterate over the selected features and plot each density plot on a different subplot
for i, feature in enumerate(selected_features):
row = i // num_cols
col = i % num_cols
ax = axes[row, col] if num_rows > 1 else axes[col]
sns.distplot(df[feature], ax=ax, kde=False, hist_kws={'edgecolor': 'black'})
ax.set_title(feature)
ax.set_xlabel('')
ax.set_ylabel('Density')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()corr = df.corr()
corr| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed acidity | 1.000000 | -0.022697 | 0.289181 | 0.089021 | 0.023086 | -0.049396 | 0.091070 | 0.265331 | -0.425858 | -0.017143 | -0.120881 | -0.113663 |
| volatile acidity | -0.022697 | 1.000000 | -0.149472 | 0.064286 | 0.070512 | -0.097012 | 0.089261 | 0.027114 | -0.031915 | -0.035728 | 0.067718 | -0.194723 |
| citric acid | 0.289181 | -0.149472 | 1.000000 | 0.094212 | 0.114364 | 0.094077 | 0.121131 | 0.149503 | -0.163748 | 0.062331 | -0.075729 | -0.009209 |
| residual sugar | 0.089021 | 0.064286 | 0.094212 | 1.000000 | 0.088685 | 0.299098 | 0.401439 | 0.838966 | -0.194133 | -0.026664 | -0.450631 | -0.097577 |
| chlorides | 0.023086 | 0.070512 | 0.114364 | 0.088685 | 1.000000 | 0.101392 | 0.198910 | 0.257211 | -0.090439 | 0.016763 | -0.360189 | -0.209934 |
| free sulfur dioxide | -0.049396 | -0.097012 | 0.094077 | 0.299098 | 0.101392 | 1.000000 | 0.615501 | 0.294210 | -0.000618 | 0.059217 | -0.250104 | 0.008158 |
| total sulfur dioxide | 0.091070 | 0.089261 | 0.121131 | 0.401439 | 0.198910 | 0.615501 | 1.000000 | 0.529881 | 0.002321 | 0.134562 | -0.448892 | -0.174737 |
| density | 0.265331 | 0.027114 | 0.149503 | 0.838966 | 0.257211 | 0.294210 | 0.529881 | 1.000000 | -0.093591 | 0.074493 | -0.780138 | -0.307123 |
| pH | -0.425858 | -0.031915 | -0.163748 | -0.194133 | -0.090439 | -0.000618 | 0.002321 | -0.093591 | 1.000000 | 0.155951 | 0.121432 | 0.099427 |
| sulphates | -0.017143 | -0.035728 | 0.062331 | -0.026664 | 0.016763 | 0.059217 | 0.134562 | 0.074493 | 0.155951 | 1.000000 | -0.017433 | 0.053678 |
| alcohol | -0.120881 | 0.067718 | -0.075729 | -0.450631 | -0.360189 | -0.250104 | -0.448892 | -0.780138 | 0.121432 | -0.017433 | 1.000000 | 0.435575 |
| quality | -0.113663 | -0.194723 | -0.009209 | -0.097577 | -0.209934 | 0.008158 | -0.174737 | -0.307123 | 0.099427 | 0.053678 | 0.435575 | 1.000000 |
corr > 0.5| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed acidity | True | False | False | False | False | False | False | False | False | False | False | False |
| volatile acidity | False | True | False | False | False | False | False | False | False | False | False | False |
| citric acid | False | False | True | False | False | False | False | False | False | False | False | False |
| residual sugar | False | False | False | True | False | False | False | True | False | False | False | False |
| chlorides | False | False | False | False | True | False | False | False | False | False | False | False |
| free sulfur dioxide | False | False | False | False | False | True | True | False | False | False | False | False |
| total sulfur dioxide | False | False | False | False | False | True | True | True | False | False | False | False |
| density | False | False | False | True | False | False | True | True | False | False | False | False |
| pH | False | False | False | False | False | False | False | False | True | False | False | False |
| sulphates | False | False | False | False | False | False | False | False | False | True | False | False |
| alcohol | False | False | False | False | False | False | False | False | False | False | True | False |
| quality | False | False | False | False | False | False | False | False | False | False | False | True |
3 Developing the model: Hyperparameter tuning
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.
Put each model in a section of its name and mention the name of the team-member tuning the model. Below is an example:
3.1 Decision Tree
By Aarti Pappu
First, I started with a basic model to better understand the ranges of the different hyperparameters of the decision tree so I could know where to start my tuning.
# Defining the object to build a regression tree
model = DecisionTreeClassifier(random_state=1)
#Fitting the regression tree to the data
model.fit(X_train, y_train)
print("Maximum number of leaves:", model.get_n_leaves())
print("Maximum depth:", model.get_depth())
print("Maximum features:", len(X_train.columns))Maximum number of leaves: 1037
Maximum depth: 25
Maximum features: 11
Using the hyperparameter ranges obtained, as well as my intuition about other hyperparameters that could be important in tuning the tree (min_samples_leaf and min_samples_split), I did a course grid search to try to narrow the range of the optimal hyperparameters. I then plotted the results of the coarse grid search to better understand the ranges of the optimal hyperparameters.
# Coarse grid search parameter grid
param_grid = {
'criterion':['gini','entropy'],
'max_depth': range(2,26,5),
'max_leaf_nodes': range(2,1038,100),
'max_features': range(1, 12,3),
'min_samples_leaf': range(1,10,2),
'min_samples_split': range(2,10,2)
}
# Using 2-fold CV because of the limited number of instances of some of the classes
skf = StratifiedKFold(n_splits=2)
# Aiming to maximize F1-score
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)
# Make the predictions
y_pred = grid_search.predict(X_test)
print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)
print('Best params for F1-score')
print(grid_search.best_params_)Fitting 2 folds for each of 8800 candidates, totalling 17600 fits
Train F1-score : 0.882
Test F1-score : 0.571
Best F1-score Through Grid Search : 0.540
Best params for F1-score
{'criterion': 'entropy', 'max_depth': 22, 'max_features': 4, 'max_leaf_nodes': 702, 'min_samples_leaf': 1, 'min_samples_split': 2}
cv_results = pd.DataFrame(grid_search.cv_results_)
fig, axes = plt.subplots(3,2,figsize=(18,20))
plt.subplots_adjust(wspace=0.2)
axes[0,0].plot(cv_results.param_max_depth, cv_results.mean_test_f1_weighted, 'o')
axes[0,0].set_ylim([0,1])
axes[0,0].set_xlabel('max_depth')
axes[0,0].set_ylabel('K-fold F1 Score')
axes[0,1].plot(cv_results.param_max_leaf_nodes, cv_results.mean_test_f1_weighted, 'o')
axes[0,1].set_ylim([0,1])
axes[0,1].set_xlabel('max_leaf_nodes')
axes[0,1].set_ylabel('K-fold F1 Score')
axes[1,0].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1,0].set_ylim([0,1])
axes[1,0].set_xlabel('max_features')
axes[1,0].set_ylabel('K-fold F1 Score')
axes[1,1].plot(cv_results.param_min_samples_leaf, cv_results.mean_test_f1_weighted, 'o')
axes[1,1].set_ylim([0,1])
axes[1,1].set_xlabel('min_samples_leaf')
axes[1,1].set_ylabel('K-fold F1 Score')
axes[2,0].plot(cv_results.param_min_samples_split, cv_results.mean_test_f1_weighted, 'o')
axes[2,0].set_ylim([0,1])
axes[2,0].set_xlabel('min_samples_split')
axes[2,0].set_ylabel('K-fold F1 Score')
axes[2,1].plot(cv_results.param_criterion, cv_results.mean_test_f1_weighted, 'o')
axes[2,1].set_ylim([0,1])
axes[2,1].set_xlabel('criterion')
axes[2,1].set_ylabel('K-fold F1 Score');Since it was clear that the optimal value of min_samples_split and min_samples_leaf were their default values (2 and 1 respectively), I did not further tune these values. However, I was uncertain about the optimal values of the remaining hyperparameters, so I decided to do a finer grid search, narrowing down the ranges for each hyperparameter based on the above graphs.
# Finer grid search
param_grid = {
'criterion':['gini','entropy'],
'max_depth': range(8,26,2),
'max_leaf_nodes': range(100,1038,50),
'max_features': range(1,12,2)
}
# Using 2-fold CV because of the limited number of instances of some of the classes
skf = StratifiedKFold(n_splits=2)
# Aiming to maximize F1-score
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)
# Make the predictions
y_pred = grid_search.predict(X_test)
print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)
print('Best params for F1-score')
print(grid_search.best_params_)Fitting 2 folds for each of 2052 candidates, totalling 4104 fits
Train F1-score : 0.857
Test F1-score : 0.545
Best F1-score Through Grid Search : 0.540
Best params for F1-score
{'criterion': 'gini', 'max_depth': 18, 'max_features': 1, 'max_leaf_nodes': 850}
cv_results = pd.DataFrame(grid_search.cv_results_)
fig, axes = plt.subplots(2,2,figsize=(18,15))
plt.subplots_adjust(wspace=0.2)
axes[0,0].plot(cv_results.param_max_depth, cv_results.mean_test_f1_weighted, 'o')
axes[0,0].set_ylim([0.4,0.6])
axes[0,0].set_xlabel('max_depth')
axes[0,0].set_ylabel('K-fold F1 Score')
axes[0,1].plot(cv_results.param_max_leaf_nodes, cv_results.mean_test_f1_weighted, 'o')
axes[0,1].set_ylim([0.4,0.6])
axes[0,1].set_xlabel('max_leaf_nodes')
axes[0,1].set_ylabel('K-fold F1 Score')
axes[1,0].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1,0].set_ylim([0.4,0.6])
axes[1,0].set_xlabel('max_features')
axes[1,0].set_ylabel('K-fold F1 Score')
axes[1,1].plot(cv_results.param_criterion, cv_results.mean_test_f1_weighted, 'o')
axes[1,1].set_ylim([0.4,0.6])
axes[1,1].set_xlabel('criterion')
axes[1,1].set_ylabel('K-fold F1 Score');Since it seemed clear that gini was the optimal criterion, I decided to only further tune the remaining hyperparameters (max_depth, max_leaf_nodes, and max_features) in order to find the optimal hyperparameter values.
# Finer grid search
param_grid = {
'max_depth': range(16,26),
'max_leaf_nodes': range(550,1038),
'max_features': range(1,5)
}
# Using 2-fold CV because of the limited number of instances of some of the classes
skf = StratifiedKFold(n_splits=2)
# Aiming to maximize F1-score
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)
# Make the predictions
y_pred = grid_search.predict(X_test)
print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)
print('Best params for F1-score')
print(grid_search.best_params_)Fitting 2 folds for each of 19520 candidates, totalling 39040 fits
Train F1-score : 0.864
Test F1-score : 0.559
Best F1-score Through Grid Search : 0.540
Best params for F1-score
{'max_depth': 17, 'max_features': 4, 'max_leaf_nodes': 613}
Based on the optimal hyperparameters found through the grid searches, I got an optimal model:
# Decision tree classifier with the optimal parameters
dt_model = DecisionTreeClassifier(random_state=1, max_depth=17, max_features = 4, max_leaf_nodes=613)
dt_model.fit(X_train,y_train)
# Getting the feature importances of the model
feature_importance_df = pd.concat([pd.Series(dt_model.feature_names_in_), pd.Series(dt_model.feature_importances_)], axis = 1)
feature_importance_df.rename(columns={0: "predictors", 1: "feature_importance"}).sort_values(by='feature_importance', ascending=False)| predictors | feature_importance | |
|---|---|---|
| 7 | density | 0.137988 |
| 1 | volatile acidity | 0.109337 |
| 10 | alcohol | 0.104010 |
| 3 | residual sugar | 0.098156 |
| 8 | pH | 0.094459 |
| 0 | fixed acidity | 0.086861 |
| 2 | citric acid | 0.080260 |
| 9 | sulphates | 0.075553 |
| 5 | free sulfur dioxide | 0.075066 |
| 6 | total sulfur dioxide | 0.073783 |
| 4 | chlorides | 0.064526 |
# Predict the labels of the test set using the optimal model
y_pred = dt_model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
# Print the confusion matrix
print(confusion_matrix(y_test, y_pred)) precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.21 0.21 0.21 33
5 0.59 0.61 0.60 291
6 0.58 0.60 0.59 440
7 0.52 0.49 0.51 176
8 0.46 0.34 0.39 35
9 0.00 0.00 0.00 1
accuracy 0.56 980
macro avg 0.34 0.32 0.33 980
weighted avg 0.56 0.56 0.56 980
[[ 0 0 1 3 0 0 0]
[ 0 7 15 10 1 0 0]
[ 1 12 177 90 10 1 0]
[ 1 10 92 265 64 8 0]
[ 0 1 9 74 87 5 0]
[ 0 3 4 11 5 12 0]
[ 0 0 0 0 1 0 0]]
# Compare to the base model
dt_base = DecisionTreeClassifier(random_state=1, criterion = 'entropy').fit(X_train, y_train)
y_pred = dt_base.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
# Print the confusion matrix
print(confusion_matrix(y_test, y_pred)) precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.30 0.30 0.30 33
5 0.65 0.64 0.64 291
6 0.67 0.67 0.67 440
7 0.57 0.57 0.57 176
8 0.43 0.57 0.49 35
9 0.00 0.00 0.00 1
accuracy 0.62 980
macro avg 0.38 0.39 0.38 980
weighted avg 0.62 0.62 0.62 980
[[ 0 1 0 2 0 1 0]
[ 0 10 13 7 3 0 0]
[ 1 10 185 82 9 4 0]
[ 0 9 73 294 56 8 0]
[ 0 3 11 49 100 13 0]
[ 0 0 1 7 7 20 0]
[ 0 0 0 0 1 0 0]]
Despite tuning, the basic model, dt_base has a better balance of precision, recall, and F1-score for most classes, as well as the highest overall accuracy.
# Final model
dt_model = DecisionTreeClassifier(random_state=1, criterion = 'entropy').fit(X_train, y_train)3.2 Bagged Decision Trees
By Divya Bhardwaj
#Creating a baseline bagging classifier model
initialmodel = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1)).fit(X_train, y_train)
#Getting the accuracy and classification report for this model
y_pred = initialmodel.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))Accuracy: 0.6408163265306123
precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.56 0.30 0.39 33
5 0.67 0.67 0.67 291
6 0.63 0.72 0.67 440
7 0.58 0.49 0.54 176
8 0.86 0.54 0.67 35
9 0.00 0.00 0.00 1
accuracy 0.64 980
macro avg 0.47 0.39 0.42 980
weighted avg 0.64 0.64 0.64 980
I chose to optimize based on the F1 score, which is usually better in the case of uneven class distribution due to how it balances precision and recall on the positive class (the right classification).
I tuned max_samples, max_features, bootstrap, and bootstrap_features. For the number of trees, I used a fixed value for the grid searches, then fine tuned it at the end once obtaining the optimal values for the other hyperparameters.
params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
'n_estimators': [300],
'max_samples': [0.5, 0.75, 1.0],
'max_features': [0.5, 0.75, 1.0],
'bootstrap': [True, False],
'bootstrap_features': [True, False]}
cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1),
param_grid =params, cv=cv, n_jobs=-1, verbose=1,
scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)
print('Best Parameters : ',bagging_classifier_grid.best_params_)Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters : {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.75, 'n_estimators': 300}
Testing the model with the optimal hyperparameters on X_test:
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=300,
random_state=1,max_features=1.0, max_samples=0.75, bootstrap=False,bootstrap_features=True).fit(X_train, y_train)
y_pred = model.predict(X_test)
print("F1 Score:", f1_score(y_test, y_pred, average='weighted')) #accounting for the classes
print(classification_report(y_test, y_pred))F1 Score: 0.6419354491362463
precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.62 0.24 0.35 33
5 0.70 0.63 0.66 291
6 0.62 0.79 0.69 440
7 0.67 0.45 0.54 176
8 0.95 0.57 0.71 35
9 0.00 0.00 0.00 1
accuracy 0.65 980
macro avg 0.51 0.38 0.42 980
weighted avg 0.66 0.65 0.64 980
#plotting max samples and max features vs. k-fold f1 accuracy
cv_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
fig, axes = plt.subplots(1,2, figsize=(10,5))
plt.subplots_adjust(wspace=0.2)
axes[0].plot(cv_results.param_max_samples, cv_results.mean_test_f1_weighted, 'o')
axes[0].set_xlabel('max_samples')
axes[0].set_ylabel('K-fold F1 Score')
axes[1].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1].set_xlabel('max_features')
axes[1].set_ylabel('K-fold F1 Score')Text(0, 0.5, 'K-fold F1 Score')
Based on these results, I refined for another gridsearch:
params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
'n_estimators': [300],
'max_samples': [0.65, 0.75, 0.85],
'max_features': [0.65, 0.75, 0.85],
'bootstrap': [True, False],
'bootstrap_features': [True, False]}
cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1),
param_grid =params, cv=cv, n_jobs=-1, verbose=1,
scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)
print('Best Parameters : ',bagging_classifier_grid.best_params_)Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters : {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': False, 'max_features': 0.65, 'max_samples': 0.65, 'n_estimators': 300}
Testing the model with the optimal hyperparameters on X_test:
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=300,
random_state=1,max_features=0.65, max_samples=0.65, bootstrap=False,bootstrap_features=False).fit(X_train, y_train)
y_pred = model.predict(X_test)
print("F1 Score:", f1_score(y_test, y_pred, average='weighted')) #accounting for the classes
print(classification_report(y_test, y_pred))F1 Score: 0.6479616424675945
precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.73 0.24 0.36 33
5 0.71 0.63 0.67 291
6 0.62 0.80 0.70 440
7 0.67 0.47 0.56 176
8 1.00 0.51 0.68 35
9 0.00 0.00 0.00 1
accuracy 0.66 980
macro avg 0.53 0.38 0.42 980
weighted avg 0.67 0.66 0.65 980
#plotting max samples and max features vs. k-fold f1 accuracy
cv_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
fig, axes = plt.subplots(1,2, figsize=(10,5))
plt.subplots_adjust(wspace=0.2)
axes[0].plot(cv_results.param_max_samples, cv_results.mean_test_f1_weighted, 'o')
axes[0].set_xlabel('max_samples')
axes[0].set_ylabel('K-fold F1 Score')
axes[1].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1].set_xlabel('max_features')
axes[1].set_ylabel('K-fold F1 Score')Text(0, 0.5, 'K-fold F1 Score')
Based on these results, I refined for another gridsearch:
params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
'n_estimators': [300],
'max_samples': [0.6, 0.65, 0.7],
'max_features': [0.6, 0.65, 0.7],
'bootstrap': [True, False],
'bootstrap_features': [True, False]}
cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1),
param_grid =params, cv=cv, n_jobs=-1, verbose=1,
scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)
print('Best Parameters : ',bagging_classifier_grid.best_params_)Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters : {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': False, 'max_features': 0.65, 'max_samples': 0.65, 'n_estimators': 300}
This gave the same model as previously, so I decided to use the actual values instead of proportions to tune more finely. In the case of max features, the previous gridsearches’ optimal values all rounded to 7, so I fixed that value.
params = {'base_estimator': [DecisionTreeClassifier(random_state = 1)],
'n_estimators': [300],
'max_samples': [2400, 2500, 2600],
'max_features': [7],
'bootstrap': [True, False],
'bootstrap_features': [True, False]}
cv = KFold(n_splits=5,shuffle=True,random_state=1)
bagging_classifier_grid = GridSearchCV(BaggingClassifier(random_state=1, n_jobs=-1),
param_grid =params, cv=cv, n_jobs=-1, verbose=1,
scoring = ['f1_weighted'], refit='f1_weighted')
bagging_classifier_grid.fit(X_train, y_train)
print('Best Parameters : ',bagging_classifier_grid.best_params_)Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Parameters : {'base_estimator': DecisionTreeClassifier(random_state=1), 'bootstrap': False, 'bootstrap_features': False, 'max_features': 7, 'max_samples': 2500, 'n_estimators': 300}
#plotting max samples vs. k-fold f1 accuracy
cv_results = pd.DataFrame(bagging_classifier_grid.cv_results_)
plt.plot(cv_results.param_max_samples, cv_results.mean_test_f1_weighted, 'o')
plt.xlabel('max_samples')
plt.ylabel('K-fold F1 Score')Text(0, 0.5, 'K-fold F1 Score')
Testing the optimal model on X_test, increasing the number of trees manually to find the optimal value since the rest of the hyperparameters are now fixed:
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=650,
random_state=1,max_features= 7, max_samples=2500, bootstrap=False,bootstrap_features=False).fit(X_train, y_train)
y_pred = model.predict(X_test)
print("F1 Score:", f1_score(y_test, y_pred, average='weighted')) #accounting for the classes
print(classification_report(y_test, y_pred))F1 Score: 0.6505479910297426
precision recall f1-score support
0 0.00 0.00 0.00 4
1 0.64 0.21 0.32 33
2 0.70 0.64 0.67 291
3 0.62 0.80 0.70 440
4 0.69 0.47 0.56 176
5 1.00 0.54 0.70 35
6 0.00 0.00 0.00 1
accuracy 0.66 980
macro avg 0.52 0.38 0.42 980
weighted avg 0.67 0.66 0.65 980
This is the overall optimal bagging model in terms of F1 score. Its overall accuracy and precision/recall balance are also optimal for most classes. Looking at its feature importances:
total = [0] * 11
for i, temp_model in enumerate(model.estimators_):
feature = model.estimators_features_[i]
for j, predictor in enumerate(feature):
total[predictor] += temp_model.feature_importances_[j]
feature_importances = np.array(total)/len(model.estimators_)
importance_df = pd.concat([pd.Series(model.feature_names_in_), pd.Series(feature_importances)], axis = 1)
importance_df.sort_values(by = 1, ascending = False)| 0 | 1 | |
|---|---|---|
| 10 | alcohol | 0.114615 |
| 7 | density | 0.108228 |
| 6 | total sulfur dioxide | 0.093421 |
| 8 | pH | 0.092443 |
| 5 | free sulfur dioxide | 0.091525 |
| 1 | volatile acidity | 0.091202 |
| 3 | residual sugar | 0.090663 |
| 4 | chlorides | 0.084718 |
| 9 | sulphates | 0.080082 |
| 2 | citric acid | 0.079759 |
| 0 | fixed acidity | 0.073344 |
For the purposes of the ensembled model, I converted max_samples to a decimal so that it could be used in the stacking ensembled model.
2500/(X_train.shape[0])0.6380806533945891
# Final model
bg_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1), n_estimators=650,
random_state=1,max_features= 7, max_samples=0.6380806533945891, bootstrap=False,bootstrap_features=False).fit(X_train, y_train)3.3 Random Forest
By Diego Schummer
# Create a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model using the training sets
rf.fit(X_train, y_train.values.ravel())
# Predict the response for the test dataset
y_pred = rf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:", accuracy_score(y_test, y_pred))
# Detailed report of classification done by the model
print(classification_report(y_test, y_pred))Accuracy: 0.6755102040816326
precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.70 0.21 0.33 33
5 0.71 0.66 0.68 291
6 0.64 0.82 0.72 440
7 0.71 0.48 0.58 176
8 0.95 0.54 0.69 35
9 0.00 0.00 0.00 1
accuracy 0.68 980
macro avg 0.53 0.39 0.43 980
weighted avg 0.68 0.68 0.66 980
In the parameter grid, I’ve added ‘class_weight’ to handle the class imbalance. It adjusts the cost function to give more weight to under-represented classes and less weight to over-represented ones. ‘balanced’ and ‘balanced_subsample’ automatically adjust weights inversely proportional to class frequencies. The ‘None’ option is added to compare the performance with and without class weighting.
### Tuning
# Define the parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'criterion': ['gini', 'entropy'],
'max_depth': [None, 10, 20, 30, 40],
'min_samples_split': [2, 10, 20],
'min_samples_leaf': [1, 5, 10],
'class_weight' : ['balanced', 'balanced_subsample', None]
}
# Create a GridSearchCV object
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=2, scoring='accuracy')
# Fit the GridSearch to our training data
grid_search.fit(X_train, y_train.values.ravel())
print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)
# Create a new Random Forest classifier with the best parameters
rf_best = RandomForestClassifier(**grid_search.best_params_, random_state=42)
# Fit the model on the training data
rf_best.fit(X_train, y_train.values.ravel())
# Predict the labels of the test set
y_pred = rf_best.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))Best parameters: {'class_weight': None, 'criterion': 'entropy', 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Best score: 0.6347626339969372
precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.67 0.24 0.36 33
5 0.71 0.65 0.68 291
6 0.63 0.79 0.70 440
7 0.67 0.49 0.57 176
8 1.00 0.54 0.70 35
9 0.00 0.00 0.00 1
accuracy 0.66 980
macro avg 0.53 0.39 0.43 980
weighted avg 0.67 0.66 0.66 980
[[ 0 0 1 3 0 0 0]
[ 0 8 16 9 0 0 0]
[ 0 3 189 99 0 0 0]
[ 0 1 58 349 32 0 0]
[ 0 0 3 87 86 0 0]
[ 0 0 0 7 9 19 0]
[ 0 0 0 0 1 0 0]]
#Feature Importance
feature_importances = pd.DataFrame(rf_best.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)- Class 6: Similar to class 5, the precision and recall are around 0.60. This class seems to be the best predicted by the model.
- The accuracy of the model is 0.56 or 56%, which is the proportion of true results (both true positives and true negatives) in the population.
- the model is performing decently for classes 5, 6, and 7 but poorly for the other classes.
# Create a new Random Forest classifier with the best parameters
rf_best = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=20, min_samples_leaf=1,
min_samples_split=2, class_weight=None, random_state=42)
# Fit the model on the training data
rf_best.fit(X_train, y_train.values.ravel())
# Predict the labels of the test set
y_pred = rf_best.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))
# Print the confusion matrix
print(confusion_matrix(y_test, y_pred)) precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.64 0.21 0.32 33
5 0.70 0.66 0.68 291
6 0.64 0.80 0.71 440
7 0.72 0.49 0.59 176
8 1.00 0.57 0.73 35
9 0.00 0.00 0.00 1
accuracy 0.67 980
macro avg 0.53 0.39 0.43 980
weighted avg 0.68 0.67 0.66 980
[[ 0 0 1 3 0 0 0]
[ 0 7 18 8 0 0 0]
[ 0 3 191 96 1 0 0]
[ 0 1 60 354 25 0 0]
[ 0 0 2 87 87 0 0]
[ 0 0 1 8 6 20 0]
[ 0 0 0 0 1 0 0]]
#comapre to base model
#create base model
rf_base = RandomForestClassifier(n_estimators=100, random_state=42)
#fit base model
rf_base.fit(X_train, y_train.values.ravel())
#predict base model
y_pred = rf_base.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred)) precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.70 0.21 0.33 33
5 0.71 0.66 0.68 291
6 0.64 0.82 0.72 440
7 0.71 0.48 0.58 176
8 0.95 0.54 0.69 35
9 0.00 0.00 0.00 1
accuracy 0.68 980
macro avg 0.53 0.39 0.43 980
weighted avg 0.68 0.68 0.66 980
best_parameters = {'class_weight': None, 'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
#fit model
rf_model = RandomForestClassifier(**best_parameters, random_state=42)
#fit model
rf_model.fit(X_train, y_train.values.ravel())
#predict model
y_pred = rf_model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred)) precision recall f1-score support
3 0.00 0.00 0.00 4
4 0.67 0.24 0.36 33
5 0.69 0.65 0.67 291
6 0.62 0.78 0.69 440
7 0.67 0.47 0.55 176
8 1.00 0.54 0.70 35
9 0.00 0.00 0.00 1
accuracy 0.66 980
macro avg 0.52 0.38 0.42 980
weighted avg 0.66 0.66 0.65 980
THe base model, rf_base has a better balance of precision, recall, and F1-score for most classes, as well as the highest overall accuracy.
print("RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=None, min_samples_leaf=1, min_samples_split=2, class_weight=None, random_state=42)")RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=None, min_samples_leaf=1, min_samples_split=2, class_weight=None, random_state=42)
3.4 XGBoost
By Yasmeen Nahas
In order to take class imbalance into account, I will be implementing an algorithm that balances the class weights of the individual classes in the XGBoost model.
First the different classes need to be binarized for the model to evaluate it as multiclass response variables.
# mapping the variables from 0 to 7 (the model does not accept the standard format)
num = {3: 0,
4: 1,
5: 2,
6: 3,
7: 4,
8: 5,
9: 6}
y_train = y_train.map(num)
y_test = y_test.map(num)# taking weights into account using an algorithm that rebalances the weights
def CreateBalancedSampleWeights(y_train, largest_class_weight_coef):
# taking the unique classes
classes = np.unique(y_train, axis = 0)
# sorting the classes depending on the number of observations in ascending order
classes.sort()
#This line uses the np.bincount() function to count the number of occurrences of each class in the y_train array. The resulting counts are stored in the class_samples array.
class_samples = np.bincount(y_train)
# The total number of samples in y_train is calculated by summing up all the counts in class_samples.
total_samples = class_samples.sum()
# The variable n_classes stores the number of unique classes found in y_train.
n_classes = len(class_samples)
# This line calculates the weight for each class by dividing the total number of samples by the product of the number of classes, the class samples, and 1.0 (used to ensure floating-point division).
weights = total_samples / (n_classes * class_samples * 1.0)
# The class_weight_dict dictionary is created, mapping each class to its corresponding weight. This is done by pairing up the classes array with the weights array using the zip() function.
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
# The weight of the second class (index 1 in classes) is multiplied by largest_class_weight_coef and updated in the class_weight_dict.
class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * largest_class_weight_coef
# This line creates a list called sample_weights by iterating over each value y in y_train and retrieving the corresponding weight from class_weight_dict.
sample_weights = [class_weight_dict[y] for y in y_train]
return sample_weightsclasses = np.unique(y_train, axis = 0)
# sorting the classes depending on the number of observations in ascending order
classes.sort()
#This line uses the np.bincount() function to count the number of occurrences of each class in the y_train array. The resulting counts are stored in the class_samples array.
class_samples = np.bincount(y_train)
# The total number of samples in y_train is calculated by summing up all the counts in class_samples.
total_samples = class_samples.sum()
# The variable n_classes stores the number of unique classes found in y_train.
n_classes = len(class_samples)
# This line calculates the weight for each class by dividing the total number of samples by the product of the number of classes, the class samples, and 1.0 (used to ensure floating-point division).
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict{0: 34.982142857142854,
1: 4.305494505494505,
2: 0.48002940455770643,
3: 0.31838127742564604,
4: 0.7950487012987013,
5: 3.9979591836734696,
6: 139.92857142857142}
# identifying the largert class coefficient
largest_class_weight_coef = max(y_train.value_counts().values)/y_train.shape[0]
#pass y_train as numpy array
weight = CreateBalancedSampleWeights(y_train, largest_class_weight_coef)classes = np.unique(y_train, axis = 0)
# sorting the classes depending on the number of observations in ascending order
classes.sort()
#This line uses the np.bincount() function to count the number of occurrences of each class in the y_train array. The resulting counts are stored in the class_samples array.
class_samples = np.bincount(y_train)
# The total number of samples in y_train is calculated by summing up all the counts in class_samples.
total_samples = class_samples.sum()
# The variable n_classes stores the number of unique classes found in y_train.
n_classes = len(class_samples)
# This line calculates the weight for each class by dividing the total number of samples by the product of the number of classes, the class samples, and 1.0 (used to ensure floating-point division).
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict{0: 34.982142857142854,
1: 4.305494505494505,
2: 0.48002940455770643,
3: 0.31838127742564604,
4: 0.7950487012987013,
5: 3.9979591836734696,
6: 139.92857142857142}
class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * largest_class_weight_coef
sample_weights = [class_weight_dict[y] for y in y_train]The algorithm adjusts the weight of each variable accordingly. Next, I use a base model as a starting point to tune my model:
# Create XGBoost classifier with the custom evaluation metric
model = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
seed=45)
model.fit(X_train, y_train, sample_weight = weight)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False,
eval_metric=<function f1_score at 0x7fc20a768820>,
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_class=7, num_parallel_tree=None, objective='multi:softmax', ...)
pd.concat([pd.Series(X_train.columns, name = 'predictor'),
pd.Series(model.feature_importances_,
name = 'importance')], axis = 1).sort_values(by = 'importance', ascending=False)
model.score(X_test, y_test)0.6275510204081632
The F1 score for the base model is of 64.3%. Next, I tune the hyperparameters.
Multiclass:soft is being used as the objective as we are looking at multiple classes. I also specified the num_class (7). eval_metric set to f1_weighted for the cross_val_score argument. I looked at each hyperparameter to find the optimal ranges:
3.4.1 1. Number of Trees
# Coarse grid search
def get_models():
models = dict()
# define number of trees to consider
n_trees = [5, 10, 50, 100, 500, 1000, 2000, 5000]
for n in n_trees:
models[str(n)] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
n_estimators=n,
random_state = 1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Number of trees',fontsize=15)>5 0.467 (0.015)
>10 0.509 (0.008)
>50 0.613 (0.016)
>100 0.639 (0.017)
>500 0.634 (0.019)
>1000 0.631 (0.017)
>2000 0.626 (0.016)
>5000 0.623 (0.015)
Text(0.5, 0, 'Number of trees')
# Finer grid search
def get_models():
models = dict()
# define number of trees to consider
n_trees = [60, 80, 100, 150, 200]
for n in n_trees:
models[str(n)] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
n_estimators=n,
random_state = 1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Number of trees',fontsize=15)>60 0.620 (0.020)
>80 0.633 (0.020)
>100 0.639 (0.017)
>150 0.634 (0.016)
>200 0.635 (0.018)
Text(0.5, 0, 'Number of trees')
An ideal range for the number of trees to be used in the model is around [150].
3.4.2 2. Depth
# Coarse grid search
# get a list of models to evaluate
def get_models():
models = dict()
# explore depths from 1 to 10
depth = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
for n in depth:
models[str(n)] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
max_depth=n,
random_state = 1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Depth',fontsize=15)>3 0.546 (0.008)
>4 0.584 (0.005)
>5 0.617 (0.013)
>6 0.639 (0.017)
>7 0.636 (0.014)
>8 0.636 (0.018)
>9 0.634 (0.011)
>10 0.640 (0.017)
>11 0.643 (0.013)
>12 0.648 (0.016)
>13 0.636 (0.018)
Text(0.5, 0, 'Depth')
# Finer grid search
# get a list of models to evaluate
def get_models():
models = dict()
# explore depths from 1 to 10
depth = [6, 7, 8, 9, 10]
for n in depth:
models[str(n)] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
max_depth=n,
random_state = 1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Depth',fontsize=15)>6 0.639 (0.017)
>7 0.636 (0.014)
>8 0.636 (0.018)
>9 0.634 (0.011)
>10 0.640 (0.017)
Text(0.5, 0, 'Depth')
The ideal range of the depth of trees that seems ideal is [7, 9, 10].
3.4.3 3. Learning Rate
# Coarse grid search
# get a list of models to evaluate
def get_models():
models = dict()
for i in [0.01,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.8,1.0]:
key = '%.4f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
learning_rate= i,
random_state = 1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Learning Rate',fontsize=15)>0.0100 0.459 (0.019)
>0.0500 0.538 (0.018)
>0.1000 0.581 (0.016)
>0.2000 0.615 (0.012)
>0.3000 0.639 (0.017)
>0.4000 0.628 (0.017)
>0.5000 0.629 (0.017)
>0.6000 0.627 (0.011)
>0.8000 0.625 (0.014)
>1.0000 0.631 (0.014)
Text(0.5, 0, 'Learning Rate')
# Finer grid search
# get a list of models to evaluate
def get_models():
models = dict()
for i in [0.1, 0.15, 0.2, 0.25,0.3, 0.35, 0.4]:
key = '%.4f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
learning_rate= i,
random_state = 1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('Learning Rate',fontsize=15)>0.1000 0.581 (0.016)
>0.1500 0.604 (0.019)
>0.2000 0.615 (0.012)
>0.2500 0.621 (0.016)
>0.3000 0.639 (0.017)
>0.3500 0.625 (0.015)
>0.4000 0.628 (0.017)
Text(0.5, 0, 'Learning Rate')
The learning rates of [0.2, 0.3, 0.35] seems ideal.
3.4.4 4. Reg Lambda
# Coarse grid search
def get_models():
models = dict()
for i in [0,0.5,1.0,1.5,2,10,100, 500, 1000]:
key = '%.2f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
reg_lambda=i,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('reg_lambda',fontsize=15)>0.00 0.632 (0.018)
>0.50 0.627 (0.017)
>1.00 0.639 (0.017)
>1.50 0.626 (0.009)
>2.00 0.628 (0.024)
>10.00 0.615 (0.014)
>100.00 0.562 (0.011)
>500.00 0.500 (0.015)
>1000.00 0.472 (0.020)
Text(0.5, 0, 'reg_lambda')
# Finer grid search
def get_models():
models = dict()
for i in [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 5.0, 10]:
key = '%.1f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
reg_lambda=i,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('reg_lambda',fontsize=15)>0.5 0.627 (0.017)
>1.0 0.639 (0.017)
>1.5 0.626 (0.009)
>2.0 0.628 (0.024)
>2.5 0.626 (0.009)
>3.0 0.627 (0.010)
>5.0 0.625 (0.009)
>10.0 0.615 (0.014)
Text(0.5, 0, 'reg_lambda')
A reg lambda value of [1.0, 2.5, 3.0] seem ideal.
3.4.5 5. Gamma
# Coarse grid search
def get_models():
models = dict()
for i in [0,10,1e2,1e3,1e4,1e5,1e6,1e7,1e8,1e9]:
key = '%.0f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
gamma=i,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)>0 0.639 (0.017)
>10 0.439 (0.030)
>100 0.285 (0.066)
>1000 0.010 (0.018)
>10000 0.010 (0.018)
>100000 0.010 (0.018)
>1000000 0.010 (0.018)
>10000000 0.010 (0.018)
>100000000 0.010 (0.018)
>1000000000 0.010 (0.018)
Text(0.5, 0, 'gamma')
# Finer grid search
def get_models():
models = dict()
for i in [0, 1, 2, 3, 4, 5]:
key = '%.1f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
gamma=i,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)>0.0 0.639 (0.017)
>1.0 0.567 (0.009)
>2.0 0.522 (0.015)
>3.0 0.500 (0.015)
>4.0 0.475 (0.014)
>5.0 0.466 (0.010)
Text(0.5, 0, 'gamma')
# Finer grid search
def get_models():
models = dict()
for i in [0, 0.1, 0.2, 0.3, 0.4, 0.5]:
key = '%.1f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
gamma=i,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)>0.0 0.639 (0.017)
>0.1 0.630 (0.014)
>0.2 0.629 (0.018)
>0.3 0.617 (0.017)
>0.4 0.606 (0.015)
>0.5 0.598 (0.009)
Text(0.5, 0, 'gamma')
# Finer grid search
def get_models():
models = dict()
for i in [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]:
key = '%.3f' % i
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
gamma=i,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('gamma',fontsize=15)>0.000 0.639 (0.017)
>0.010 0.629 (0.022)
>0.020 0.634 (0.015)
>0.030 0.631 (0.014)
>0.040 0.624 (0.013)
>0.050 0.629 (0.017)
>0.060 0.630 (0.018)
>0.070 0.637 (0.011)
>0.080 0.630 (0.014)
>0.090 0.631 (0.020)
>0.100 0.630 (0.014)
Text(0.5, 0, 'gamma')
Gamma [0.07, 0.08] seems ideal.
3.4.6 6. Subsample
# Coarse grid search
def get_models():
models = dict()
sub = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
for s in sub:
key = '%.2f' % s
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
subsample = s,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('subsample',fontsize=15)>0.10 0.555 (0.011)
>0.20 0.596 (0.022)
>0.30 0.609 (0.015)
>0.40 0.627 (0.009)
>0.50 0.623 (0.015)
>0.60 0.634 (0.019)
>0.70 0.636 (0.006)
>0.80 0.635 (0.013)
>0.90 0.635 (0.010)
>1.00 0.639 (0.017)
Text(0.5, 0, 'subsample')
# Finer grid search
def get_models():
models = dict()
sub = [0.7, 0.75, 0.8, 0.85, 0.9]
for s in sub:
key = '%.2f' % s
models[key] = xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
eval_metric=f1_score,
subsample = s,
random_state=1)
return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y, sample_weights):
# define the evaluation procedure
cv = KFold(n_splits=5, shuffle=True, random_state=1)
fit_params = {"sample_weight": sample_weights}
# evaluate the model and collect the results
scores = (cross_val_score(model, X, y, fit_params = fit_params, scoring= 'f1_weighted', cv=cv, n_jobs=-1))
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X_train, y_train, weight)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.ylabel('Cross validation score',fontsize=15)
plt.xlabel('subsample',fontsize=15)>0.70 0.636 (0.006)
>0.75 0.637 (0.013)
>0.80 0.635 (0.013)
>0.85 0.633 (0.020)
>0.90 0.635 (0.010)
Text(0.5, 0, 'subsample')
Seems like the optimal range for coarse subsample is [0.7, 0.85, 0.9]
param_grid = {'n_estimators':[80],
'max_depth': [8, 9],
'learning_rate': [0.2, 0.3, 0.35],
'gamma': [0.07, 0.08],
'reg_lambda':[1.0, 2.5, 3.0],
'subsample': [0.7, 0.85, 0.9]}
cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=1)
optimal_params = GridSearchCV(estimator=xgb.XGBClassifier(objective='multi:softmax',
num_class=7,
random_state=1),
param_grid = param_grid,
scoring = 'f1_weighted',
verbose = 1,
n_jobs= -1,
cv = cv)
optimal_params.fit(X_train, y_train, sample_weight = weight)
print(optimal_params.best_params_,optimal_params.best_score_)Fitting 5 folds for each of 108 candidates, totalling 540 fits
{'gamma': 0.07, 'learning_rate': 0.2, 'max_depth': 9, 'n_estimators': 80, 'reg_lambda': 1.0, 'subsample': 0.7} 0.648089065194428
Further tuning was done to obtain the optimal hyperparameters:
xgb_model = xgb.XGBClassifier(objective = 'multi:softmax', random_state = 1, gamma = 0.06, learning_rate = 0.3, max_depth = 9,
n_estimators = 150, reg_lambda = 1.0, subsample = 0.8, num_class= 7, eval_metric=f1_score)
xgb_model.fit(X_train, y_train, sample_weight = weight)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False,
eval_metric=<function f1_score at 0x7fc20a768820>,
feature_types=None, gamma=0.06, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.3, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=9,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=150, n_jobs=None,
num_class=7, num_parallel_tree=None, objective='multi:softmax', ...)
y_pred = xgb_model.predict(X_test)
# Print the precision and recall, among other metrics
print(classification_report(y_test, y_pred, digits=3))
# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))
# Metrics on Test Data
print("Accuracy on Test Data: ", accuracy_score(y_test, y_pred))
print("Precision on Test Data: ", precision_score(y_test, y_pred, average='weighted'))
print("Recall on Test Data: ", recall_score(y_test, y_pred, average='weighted'))
print("F1-score on Test Data: ", f1_score(y_test, y_pred, average='weighted')) precision recall f1-score support
0 0.000 0.000 0.000 4
1 0.444 0.364 0.400 33
2 0.684 0.646 0.664 291
3 0.664 0.723 0.692 440
4 0.600 0.580 0.590 176
5 0.741 0.571 0.645 35
6 0.000 0.000 0.000 1
accuracy 0.653 980
macro avg 0.448 0.412 0.427 980
weighted avg 0.650 0.653 0.650 980
[[ 0 1 1 2 0 0 0]
[ 0 12 14 6 1 0 0]
[ 1 8 188 85 9 0 0]
[ 0 5 65 318 49 3 0]
[ 0 1 6 62 102 4 1]
[ 0 0 1 6 8 20 0]
[ 0 0 0 0 1 0 0]]
Accuracy on Test Data: 0.6530612244897959
Precision on Test Data: 0.6502442182752354
Recall on Test Data: 0.6530612244897959
F1-score on Test Data: 0.650376342395648
After tuning and fitting the adjusted weight of the model, the F1 score improved to 65.3% (adjusting the sample weight increased the model performance from 60.6% to 65.3%).
pd.concat([pd.Series(X_train.columns, name = 'predictor'),
pd.Series(xgb_model.feature_importances_,
name = 'importance')], axis = 1).sort_values(by = 'importance', ascending=False)| predictor | importance | |
|---|---|---|
| 10 | alcohol | 0.154100 |
| 0 | fixed acidity | 0.138364 |
| 5 | free sulfur dioxide | 0.102930 |
| 4 | chlorides | 0.099804 |
| 8 | pH | 0.087954 |
| 1 | volatile acidity | 0.083195 |
| 7 | density | 0.077974 |
| 3 | residual sugar | 0.076811 |
| 6 | total sulfur dioxide | 0.073330 |
| 2 | citric acid | 0.058400 |
| 9 | sulphates | 0.047138 |
Alcohol is the most important predictor for the XGBoost model.
4 Model Ensemble
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.
4.1 Voting ensemble - Hard Voting
By Aarti Pappu
Ensembling all 4 developed models:
#Voting ensemble: Averaging the predictions of all models
voting_ensemble = VotingClassifier(estimators = [('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model)])
voting_ensemble.fit(X_train,y_train)
voting_pred = voting_ensemble.predict(X_test)
print(classification_report(y_test, voting_pred, digits=3))
print(confusion_matrix(y_test, voting_pred)) precision recall f1-score support
3 0.000 0.000 0.000 4
4 0.615 0.242 0.348 33
5 0.684 0.677 0.680 291
6 0.634 0.789 0.703 440
7 0.717 0.460 0.561 176
8 1.000 0.543 0.704 35
9 0.000 0.000 0.000 1
accuracy 0.665 980
macro avg 0.522 0.387 0.428 980
weighted avg 0.673 0.665 0.655 980
[[ 0 0 1 3 0 0 0]
[ 0 8 17 8 0 0 0]
[ 0 3 197 91 0 0 0]
[ 0 1 68 347 24 0 0]
[ 0 1 4 90 81 0 0]
[ 0 0 1 8 7 19 0]
[ 0 0 0 0 1 0 0]]
A logistic model was added to see if it would decorrelate the predictions (since our models are all tree-based) and increase the precision, recall, and/or accuracy.
log_model = LogisticRegression(solver = 'newton-cg', max_iter = 1000).fit(X_train, y_train)
#Voting ensemble: Averaging the predictions of all models
voting_ensemble = VotingClassifier(estimators = [('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model),('log',log_model)])
voting_ensemble.fit(X_train,y_train)
voting_pred = voting_ensemble.predict(X_test)
print(classification_report(y_test, voting_pred, digits=3))
print(confusion_matrix(y_test, voting_pred)) precision recall f1-score support
3 0.000 0.000 0.000 4
4 0.615 0.242 0.348 33
5 0.701 0.653 0.676 291
6 0.641 0.811 0.716 440
7 0.708 0.483 0.574 176
8 1.000 0.543 0.704 35
9 0.000 0.000 0.000 1
accuracy 0.672 980
macro avg 0.524 0.390 0.431 980
weighted avg 0.680 0.672 0.662 980
[[ 0 0 1 3 0 0 0]
[ 0 8 17 8 0 0 0]
[ 0 4 190 97 0 0 0]
[ 0 1 56 357 26 0 0]
[ 0 0 6 85 85 0 0]
[ 0 0 1 7 8 19 0]
[ 0 0 0 0 1 0 0]]
The voting ensemble model’s precision increased slightly with the addition of the logistic regression model.
4.2 Stacking ensemble(s)
By Aarti Pappu
#Using Logistic regression as the meta model (final_estimator)
stacking_ensemble = StackingClassifier(estimators=[('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model)],
final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
stacking_ensemble.fit(X_train,y_train)
stacking_pred = stacking_ensemble.predict(X_test)
print(classification_report(y_test, stacking_pred, digits=3))
print(confusion_matrix(y_test, stacking_pred)) precision recall f1-score support
3 0.000 0.000 0.000 4
4 0.688 0.333 0.449 33
5 0.706 0.653 0.679 291
6 0.637 0.789 0.705 440
7 0.687 0.511 0.586 176
8 1.000 0.543 0.704 35
9 0.000 0.000 0.000 1
accuracy 0.670 980
macro avg 0.531 0.404 0.446 980
weighted avg 0.678 0.670 0.663 980
[[ 0 0 2 2 0 0 0]
[ 0 11 15 7 0 0 0]
[ 0 3 190 98 0 0 0]
[ 0 2 59 347 32 0 0]
[ 0 0 3 83 90 0 0]
[ 0 0 0 8 8 19 0]
[ 0 0 0 0 1 0 0]]
A logistic model was added to see if it would decorrelate the predictions (since our models are all tree-based) and increase the precision, recall, and/or accuracy.
#Using Logistic regression as the meta model (final_estimator)
stacking_ensemble = StackingClassifier(estimators=[('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model), ('log', log_model)],
final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
stacking_ensemble.fit(X_train,y_train)
stacking_pred = stacking_ensemble.predict(X_test)
print(classification_report(y_test, stacking_pred, digits=3))
print(confusion_matrix(y_test, stacking_pred)) precision recall f1-score support
3 0.000 0.000 0.000 4
4 0.647 0.333 0.440 33
5 0.703 0.660 0.681 291
6 0.636 0.780 0.701 440
7 0.682 0.511 0.584 176
8 1.000 0.543 0.704 35
9 0.000 0.000 0.000 1
accuracy 0.668 980
macro avg 0.524 0.404 0.444 980
weighted avg 0.675 0.668 0.662 980
[[ 0 0 2 2 0 0 0]
[ 0 11 15 7 0 0 0]
[ 0 4 192 95 0 0 0]
[ 0 2 61 343 34 0 0]
[ 0 0 3 83 90 0 0]
[ 0 0 0 9 7 19 0]
[ 0 0 0 0 1 0 0]]
The overall performance of the stacked ensemble model was worse with the logistic regression model included, so it was excluded from the final stacked ensemble model:
# Final stacking ensemble model
stacking_ensemble = StackingClassifier(estimators=[('dt',dt_model),('bg',bg_model),('rf',rf_model),('xgb',xgb_model)],
final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
stacking_ensemble.fit(X_train,y_train)
stacking_pred = stacking_ensemble.predict(X_test)4.3 Ensemble of ensembled models (using stacking ensemble)
By Aarti Pappu
#Using Logistic regression as the meta model (final_estimator)
ensemble_model = StackingClassifier(estimators=[('voting',voting_ensemble),('stacking',stacking_ensemble)],
final_estimator=LogisticRegression(random_state=1,max_iter=10000),n_jobs=-1,
cv = StratifiedKFold(n_splits=2,shuffle=True,random_state=1))
ensemble_model.fit(X_train,y_train)
ensemble_pred = ensemble_model.predict(X_test)
print(classification_report(y_test, ensemble_pred, digits=3))
print(confusion_matrix(y_test, ensemble_pred)) precision recall f1-score support
3 0.000 0.000 0.000 4
4 0.692 0.273 0.391 33
5 0.703 0.674 0.688 291
6 0.646 0.789 0.710 440
7 0.689 0.517 0.591 176
8 1.000 0.543 0.704 35
9 0.000 0.000 0.000 1
accuracy 0.676 980
macro avg 0.533 0.399 0.441 980
weighted avg 0.682 0.676 0.668 980
[[ 0 0 2 2 0 0 0]
[ 0 9 17 7 0 0 0]
[ 0 3 196 91 1 0 0]
[ 0 1 61 347 31 0 0]
[ 0 0 3 82 91 0 0]
[ 0 0 0 8 8 19 0]
[ 0 0 0 0 1 0 0]]